Addressing Internet of Things security by enhanced sine cosine metaheuristics tuned hybrid machine learning model and results interpretation based on SHAP approach

An ever increasing number of electronic devices integrated into the Internet of Things (IoT) generates vast amounts of data, which gets transported via network and stored for further analysis. However, besides the undisputed advantages of this technology, it also brings risks of unauthorized access and data compromise, situations where machine learning (ML) and artificial intelligence (AI) can help with detection of potential threats, intrusions and automation of the diagnostic process. The effectiveness of the applied algorithms largely depends on the previously performed optimization, i.e., predetermined values of hyperparameters and training conducted to achieve the desired result. Therefore, to address very important issue of IoT security, this article proposes an AI framework based on the simple convolutional neural network (CNN) and extreme machine learning machine (ELM) tuned by modified sine cosine algorithm (SCA). Not withstanding that many methods for addressing security issues have been developed, there is always a possibility for further improvements and proposed research tried to fill in this gap. The introduced framework was evaluated on two ToN IoT intrusion detection datasets, that consist of the network traffic data generated in Windows 7 and Windows 10 environments. The analysis of the results suggests that the proposed model achieved superior level of classification performance for the observed datasets. Additionally, besides conducting rigid statistical tests, best derived model is interpreted by SHapley Additive exPlanations (SHAP) analysis and results findings can be used by security experts to further enhance security of IoT systems.


INTRODUCTION
During the first two industrial revolutions, production was mechanized, first by the introduction of steam power, then by electrification, and eventually mass production was  The Domain of Industry 4.0 includes Cyber-Physical Systems (CPS), IoT, Industrial IoT (IIoT), AI, big data, digital twins and other technologies without which contemporary factories could not exist (Anbesh et al., 2021). The United Nations' (UN) Sustainability 2030 agenda highlights the production efficiency with minimal use of resources as the core of the future business strategy, including smart production and industrialization with a low impact on the living environment (Stock & Seliger, 2016).
In addition to the intensive application of information technologies, the progress of the 4th industrial revolution is backed by recent significant advances in the fields of ML, Computer Vision (CV) and AI, but also manufacturing technologies as well. 3D printing, for example, enables rapid production of necessary components and parts. This progress supported the transition from traditional factories to the smart factory concept Oztemel & Gursev, 2020). The described technologies have enabled the emergence of flexible production lines based on CPS in various branches of industry. The basic characteristics of such production lines are modularity and interchangeability, which provides mass production capabilities in accordance with the individual needs of the customer (Zheng et al., 2021).
At the moment, most of the research in the field of smart factories is focused on the planning and sustainability of production (Aiello et al., 2020;Oztemel & Gursev, 2020;Zhang et al., 2022), as well as on the shortening of the supply chain, partially due to the current geopolitical events in the world (Anbesh et al., 2021;Zheng et al., 2021;Asokan et al., 2022;Herawati et al., 2021;Little & Sylvester, 2022;Mandičák et al., 2021).

Internet of Things and smart factories
One of the fundamental terms associated with the 4th industrial revolution is the IoT, which essentially represents a set of devices that use a wireless connection for mutual communication, physical quantities reading with various sensors, processing, sharing and storing information via the Internet. Nodes within the IoT system can be electronic or embedded devices, as well as physical objects, which communicate with each other and operate without the need for human intervention (Abu Khurma, Almomani & Aljarah, 2021).
In the context of smart factories, the improvement of product quality and productivity rates are recognized as the main benefits of the 4th industrial revolution, while the improvement of the quality of services is noticeable in health . The general progress in the field of IT, such as affordable computing resources and emergence of appropriate algorithms, has led to rapid development of technologies, especially AI and ML, which in turn evolved into base tools for solving difficult problems in various fields, including finance, industry, healthcare, human resources, software development and defect prediction, agriculture and logistics, just to name few (Buchanan, 2019;Peres et al., 2020;Yu, Beam & Kohane, 2018;Dobrojevic & Bacanin, 2022;AVSystem, 2020;Biliavska, Castanho & Vulevic, 2022;Zheng et al., 2022;Holliday, Sani & Willett, 2015;Muhammad, Abdullah & Samsiah Sani, 2021;Mohamed Nafuri et al., 2022;Abdul Rahman et al., 2021).
Healthcare 4.0 represents a critical field of research that is directly related to the development of IoT and ML technologies. During the ongoing COVID-19 pandemic, the technologies of the 4th industrial revolution led to the development of digital solutions that provided tools for efficient management during the crisis of supply chain, human and materials resources, protection of medical personnel, and provided means for remote work of people in wide variety of industries as well (Javaid et al., 2020).
Besides direct improvement of conditions and quality of life, diagnostics and prevention play an increasingly important role in providing adequate and timely treatment, especially in diseases such as cancer and diabetes (Howell, 2010;Hopek & Siniak, 2020).
In scenarios when several medical conditions share similar symptoms, when symptoms appear only in the late stages of the disease, or when it is difficult to recognize symptoms and provide a diagnosis for whatever other reason (Gershon- Cohen, Berger & Curcio, 1966), timely implementation of adequate treatment is of crucial importance. Even nowadays, health system in diagnosis overly rely on the expertise and experience of doctors, in spite of their limited number and current geographical location. Technologies based on IoT, AI, and ML allow medical staff to work with patients remotely, and provide machines with a certain degree of autonomy in diagnosis, thereby reducing the diagnostic time, the risk of misdiagnosis, and in overall, the pressure on medical staff (Szolovits, 1988).
In real-world application, currently available computing resources often may prove to be inadequate for large amounts of data processed by AI and ML models, which in turn may affect the overall system performance and reliability. However, increasingly frequent cyber-attacks on IoT systems call into question the credibility of service providers, and consequently threaten business operations and finance.
Cloud Computing (CC) is recognized as technology capable of handling the large amounts of data and traffic generated by IoT systems, but also introduced numerous risks, e.g., inconsistent performance, security risks, latency, and possibility of network breakdowns (Sabireen & Neelanarayanan, 2021Li & Geng, 2023. More recently, technology of Fog Computing (FC) was introduced in order to deal with these issues as an intermediary between the IoT and CC. The key task of the FC is to provide the data generated by the nearby IoT devices. Performing the task locally at the fog node rather than relaying information to the cloud server, FC may deliver services with higher quality and better response time.
In addition, ML algorithms contain a large number of hyperparameters used to control the learning process, whose values often cannot be resolved in an optimal manner and thus essentially affect the speed and quality of the learning process. Traditionally, these parameters are determined through the trial and error, an approach that may be suitable in simpler scenarios, but inapplicable with more complex problems. The latter requires optimization of the methods for hyperparameters determination, which recently brought it under the spotlight of researchers (Feurer & Hutter, 2019).
Operating systems, e.g., Windows and Linux, as well as IoT networks, have security vulnerabilities prone to exploitation in order to provide attackers with access to the system and data, and thus IoT systems have been recognized as prime targets for large-scale cyber attacks. Communication between IoT devices can be intercepted and manipulated, and there is always the possibility of one or more devices in the system malfunctioning. Hacking tools are readily available and easy to use, without any specialized skills required to carry on a successful attack (Louvieris, Clewley & Liu, 2013). That is why the detection of failures and potential network intrusions in real time is one of the priorities for the secure and stable operation of the system (Stone-Gross et al., 2009).
The AI, and especially ML and deep learning (DL) algorithms can be used to overcome such challenges, as tools for error prediction, intrusion detection, diagnostics, etc. Effectiveness of the chosen tool in a given situation directly depends on the feature selection (FS), value of hyperparameters and model training conducted in order to achieve the desired result . Despite the use of advanced security tools such as firewalls, antivirus software, data encryption or biometric verification (Al-Jarrah et al., 2016), cyber attacks continue to affect organizations and businesses, and attackers exploit system vulnerabilities in order to gain access and perform attacks, e.g., theft of sensitive information.
In the last decade, a multitude of IT security solutions based on AI have been introduced, Intrusion Detection Systems (IDS) being one of them (Kareem et al., 2022). IDSs proved to be effective in protection of IoT systems (Ashraf et al., 2021;Zhou et al., 2020) due to the possibility of network traffic analysis, distinguishing the legite from malicious traffic, and determining the type of detected attack. There are two basic types of IDSs: Host-based IDSs (HIDSs) are programs on the host machine that autonomously monitor system calls and logs, i.e., software agents, with the aim of detecting unauthorized activities.
Network-based IDSs are placed at key positions in the computer network in order to monitor the traffic.
Another obstacle for ML is the variety of attack types and network traffic features, which makes the problem solving more complex (Aljawarneh, Aldwairi & Yassein, 2018). Based on their mechanism, IDSs could be further classified into (Moustafa et al., 2020;Kareem et al., 2022): Signature-based IDSs detect potentially illegal activites based on previously known patterns, i.e., signatures (Freeman et al., 2002;De La Hoz et al., 2015). Signature-based HIDS monitor the host by scanning network traffic, logs and memory dumps. Although fast and reliable, they cannot detect previously unknown attacks. They require regular updating of attack pattern definitions aswell, otherwise they will fail to provide detection even for minor changes in the pattern, which is a well known vulnerability often abused by attackers. Anomaly-based IDSs compile a profile of the regular system behaviour, which is afterwards used for detection of harmful actions (Jose et al., 2018). This is usually achieved by applying of ML and DL models that are trained on previously collected data. Therefore, they can detect new, previously unknown attacks, as well as mutations of known attacks ; but with the tradeoff of increased processing power required compared to signature-based IDSs (Jose et al., 2018;Moustafa et al., 2019). Despite the high rates of false positives (FPs), this approach is useful in detection of innovative types of attack.

Research summary and contributions
The research presented within this article tries to further improve IoT security by verifying the performance of the hybrid CNN and ELM structure to solve the network attacks classification problem on the relatively novel TON_IoT Windows 7 and Windows 10 datasets (Moustafa et al., 2020), that are considered as benchmarks for determination of the efficiency of the intrusion detection systems. The elementary lightweight CNN network is used to perform the feature extraction, while the ELM is employed for classification of the features extreacted by the CNN. The CNNs have proven to be very successful in a variety of complex tasks. Beside image classification, they are capable of automatically discovering hidden patterns in data, which other models are not able to do (Sharma et al., 2021). Additionally, lightweight CNN is easy and quick to train. At the other hand, other models have much better classification capabilities compared to the CNN's dense layers, and this fact directed proposed research towards replacing CNN's dense layers with traditional ML model.
There are also some other examples in literature where other models were combined with CNNs, where CNNs perform feature extraction, and other models are assigned to execute classification, for example eXtreme gradient boosting (XGBoost) model (Thongsuwan et al., 2021;Khan et al., 2022;Niu et al., 2020), long short term memory model (LSTM)  and support vector machine (SVM) (Sun et al., 2019). In this work, the ELM was chosen as it does not require classical training, instead it only requires initialization of weight and bias values. Therefore, the goal of proposed research is to develop as lighter as possible hybrid ML/DL model to deal with this challenging security task. The extensive literature survey has also revealed that this particular CNN and ELM hybrid combination has never been utilized to address the intrusion detection problem.
However, since the ELM's performance at large extent depends on the randomly initialized weights and biases and the number of neurons in the single hidden layer, this research makes use of a modified version of the SCA for tuning of the ELM for this particular issue with the goal of further improving classification performance of the model. Since the problem of determining ELM's weights and biases and number of neurons for specific problem falls into the category of mixed integer continuous non-deterministic polynomial hard (NP-hard) optimization, the choice of employing metaheuristics in this case is logical because they proved to be very efficient NP-hard problem solvers (Zivkovic et al., , 2020. Finally, it is also worth pointing out that metaheuristics-based methods can always be enhanced, by modifications or hybridization with other approaches. According to the no free lunch theorem (Wolpert & Macready, 1997), algorithm capable of obtaining the best outcomes for every optimization problem does not exists and for each particular task specific algorithm can be introduced.
In accordance to everything stated above, this article offers the following set of contributions: An efficient and lightweight hybrid CNN-ELM model is develop to address IoT security challenges; An enhanced version of SCA metaheuristics was developed to specifically target the known limitations exhibited by the elementary SCA variant; The suggested devised algorithm was utilized to discover the adequate hyperparameters' values and improve the ELM classification accuracy as a component of the framework designated for the intrusion detection classification; The results attained by the proposed structure were compared to other noteworthy metaheuristics, used in the identical experimental framework to classify the network attacks.
The remainder of this manuscript has been assembled as follows. The next section introduces the basics of neural networks and ELM, together with the fundamentals of metaheuristics optimization. Afterwards, elementary SCA has been explained, together with its known flaws, and the modified SCA has been proposed, together with the suggested classification framework. The next section brings forward the experimental setup, experimental outcomes, statistical analysis and model interpretation. Lastly, the final section summarizes the research, hints the future research possibilities and winds up the manuscript.

PRELIMINARIES AND RELATED WORKS
This sections provides background related to the methods utilized in this research. First, a brief introduction to the artificial neural networks is given, followed by the theoretical background of the utilized ELM model. Finally, a brief overview of metaheuristics optimization is given.

Artificial neural networks
Artificial neural networks (ANN) are used to solve problems from different domains, which are difficult to solve or cannot be solved using traditional programming techniques. The ANNs can provide quality results in (un)supervised machine learning tasks (Krogh, 2008). The ANNs and its types, e.g., CNNs, recurrent neural networks (RNNs), etc., are extensively used in pattern recognition, classification of objects, and prediction. Various forms of ANN are used in traffic for road management (Olayode et al., 2021;Ren et al., 2022) and autonomous vehicle control (Zhang, Jing & Xu, 2021), in civil engineering to predict the fatigue of structural materials (Bai et al., 2021), in the military for quantum communication (Quach, 2021) and aerial swarms (Abdelkader et al., 2021), in agriculture for detection of plants diseases (Roy & Bhaduri, 2021) and assessment of soil suitability (Vincent et al., 2019), while in medicine they are used for diagnostics (Esteva et al., 2017), pandemic related applications (Adedotun, 2022) and classification of heart diseases and diabetes, just to name few.
The ANNs may refer to hardware system or software application with architecture influenced by biological neural networks, such as those in the natural brain (Bhadeshia, 2008;Brahme, 2014). It is based on a set of interconnected points referred to as nodes, emulating neurons, and each connection (edge, or synapse in nature) transmits a signal (a real number) from one neuron to another. The output of each neuron is determined by a non-linear function of the sum of its inputs. Neurons and edges usually have a weight factor that changes during the learning process, and affects the strength of the signal on the connection. Neurons can also have a threshold, passing the signal through only when the aggregate signal exceeds that threshold. Typically, neurons are arranged in layers (Bre, Gimenez & Fachinotti, 2018), and different layers can perform different transformations on input signals. The input signal travels from the input layer (the first layer) to the output layer (the last layer), and may pass through the intermediate layers (hidden layers) multiple times. Each neuron has a local memory in which it remembers the data it processes.
A feedforward neural network (FNN) is an ANN where connections between nodes, i.e., information, always propagate forward (Zell, 2003). Single-layer perceptron (SLP) is the simplest form of ANN, having only two layers, input and output, but unfortunately it is not capable of processing efficiently nonlinearly separable patterns (Hu, 2014;Ojha, Abraham & Snášel, 2017). Multilayer perceptrons (MLPs) overcome these shortcomings by having one or more hidden layers, making them the most popular form of ANN at the moment. Some of the important advantages they possess are robustness, learning capacity, parallel processing and capacity to generalize (Faris et al., 2016).
In common speech the process of "capturing" the unknown information is called "learning" or "training" of an ANN. In mathematical context however, to "learn" refers to adjustment of the weight coefficients in order to satisfy the predefined conditions (Svozil, Kvasnicka & Pospichal, 1997). The training of the ANN directly affects the quality of the model, and thus making necessary the optimization of the loss function during the learning process (Duchi, Hazan & Singer, 2011;Zeiler, 2012;Kingma & Ba, 2015;Cheng et al., 2022). In general, training processes can be classified as: Supervised training. ANN 'knows' the desired output, so the weight coefficients are adjusted in such a manner that the calculated and desired outputs are as close as possible. Unsupervised training. The desired output is not known, the system is provided with a group of facts and then left alone to autotune towards a stable state in a limited number of iterations.
During this process, over-fitting can occur, i.e., significant deviations in training and test accuracy, indicating that the network has learned specific data and cannot properly process data outside that range. This problem can be treated by regularization, and some of the suggested approaches are batch, data augmentation, dropout, drop connect, early stopping, L1/L2, etc.
In DL, CNN is a class of ANN. The CNNs mimic the pattern of connectivity between neurons of a biological visual cortex (Fukushima, 1980;Matsugu et al., 2003), making them an excellent tool for feature extraction, especially suitable in the field of CV. CNNs use relatively light pre-processing because the network "learns" to optimize filters or kernels through ML, compared to traditional algorithms where these filters must be set manually. That very independence from prior knowledge and human intervention in feature extraction is a major advantage.

Extreme learning machines
The ELM presents an ML approach applied to single-layer FNNs. This approach randomly activates hidden neurons in the network, followed by processing stages that determine the output weights via Moore-Penrose generalized inverse. Application of hidden layers and non-linear transformations converts input values into ELM features space in higher dimensions, which simplifies the original problem .
If a training set @ ¼ ðx i ; t i Þjx i 2 R d ; t i 2 R m ; i ¼ 1; …:; N has L hidden neurons, and an activation function gðxÞ, outputs may be determined as shown in Eq. (1).
Furthermore, the parameter T may be calculated with the use of Eq. (3): where H is the hidden layer output matrix shown in Eq. (4) Output weights b are being determined using the minimum norm least-square as shown where H y represents the generalized Moore-Penrose inverse of H. Initialization of weight and bias variables with random values is essential for the classifier performance, represents a NP-hard challenge, and needs to be optimized for each particular classification problem.

Metaheuristics optimization
Metaheuristics optimization is a field of AI modelled after examples often found in social groups in nature (Hu et al., 2021), and capable of solving complex real-world problems. Swarm intelligence is one of the most prominent groups of metaheuristics, where algorithms mimic the behavior of an individual in a group, e.g., in a colony, flock or herd, in order to solve the targeted problem.
An important feature of metaheuristics algorithms is their ability to deal with complex tasks using both limited computing resources and limited time frames, which cannot be achieved with a traditional mathematical approach. Single execution of the algorithm cannot guarantee desired results due to the inherent randomness, but each subsequent execution increases the chances of finding the true optimum, and thus such algorithms must run through several iterations. Although each algorithm may possess unique properties, the basic components enabling such algorithms to solve NP-hard problems are: Research. The algorithm covers large areas within the search space, looking for sub-areas potentially containing better solutions. Exploitation. The algorithm focuses on certain sub-areas, locating the best solution.
Desired results can be achieved only if the adequate balance between research and exploitation was reached, that suits the specific problem. Most algorithms in this area use search agents tuned to work under simple sets of rules, allowing complex behavior to manifest globally.
Inspiration can be drawn from abstract ideas as well. The SCA have origins in trigonometry (Mirjalili, 2016), the arithmetic optimization algorithm uses simple mathematical formulations (Abualigah et al., 2021), as well as the search algorithm modelling user behavior on social networks (Liang et al., 2006).

DEVELOPED METHOD AND PROPOSED FRAMEWORK
This section first introduces basics of the original SCA metaheuristics, followed by its observed deficiencies and detailes of proposed improved approach. Finally, this section concluded with introduced hybrid ML framework used for classification and solutions' encoding scheme employed by developed metaheuristics.

Basic sine cosine algorithm
Mathematical model of the SCA is inspired by the trigonometric functions (Mirjalili, 2016). The position updating is conducted according to the specified functions, making them prone to oscillations in the region of the optimum, and the return values fall into the ½À1; 1 range. During the initialization phase, the algorithm generates multiple solutions as candidates for the best solution given the constraints of the search area and randomized adaptive parameters control the exploration and exploitation phases, Fig. 1 and the pseudo-code in Algorithm 1.
The position updating is performed as follows (Mirjalili, 2016): where i-th and t-th are dimensions i þ 1-th is iteration X t ij and X tþ1 ij denote the positioning for a given solution in the terms of dimension and iteration r 1À3 is generated pseudo-random number P Ã ij is the position of the target |…| represents the absolute value The equations are combined with the use of control parameter r 4 : Algorithm 1 The pseudo-code of the SCA (Gabis et al., 2021).
Initialize randomly a set of solutions X i ði ¼ 1; 2; …; nÞ Calculate the objective value for each solution Update the destination (P ¼ X) Update the random parameters r 1 , r 2 , r 3 and r 4 Update Update the solutions using Eq. (9 where r 4 denotes a randomly selected number from the ½0; 1 interval. Cyclic sequences due to the sine and cosine functions allow for repositioning near the solution. In order to enhance exploration and the randomness quality, the range for the parameter r 2 is set to ½0; 2Å. The following equation is used to control the diversification and provide the exploitation balance: where: t is the current repetition T denotes the maximum allowed amount of possible repetitions per run a hardcoded, empirically determined value set to 2:0 (as suggested in Mirjalili (2016)), not adjustable by the user The SCA meta-heuristic provides impressive performance with bound-constrained and unconstrained benchmarks, with a relative simplicity and small number of control parameters (Mirjalili, 2016). However, when testing with standard Congress on Evolutionary Computation (CEC) benchmarks, the algorithm tends to converge too fast towards current best solutions, with reduced diversity of the population. Due to directed search towards the P Ã , if the initial results are too far from the optimum, the population will quickly converge towards disadvantageous domain of the search space, with unsatisfactory final results.

Enhanced sine cosine algorithm
To address the known cons of the basic algorithm, an enhanced version of SCA has been proposed for the sake of the research presented in this article, based on two procedures that have been included in the original metaheuristics: 1) Chaotic initialization of solutions forming the initial population, and 2) Self-adaptive search mechanism switching the search process betwixt classic SCA search and firefly algorithm (FA) search procedure.
The first proposed alternation of the basic version of SCA is chaotic initialization of the starting population. This approach aims to produce the starting set of solutions near the optimum region of the search realm. It was proposed by Caponetto et al. (2003), who embedded the chaotic maps inside metaheuristics algorithms to improve the search phase. Other relevant studies, including Wang & Chen (2020), Liu et al. (2021), Kose (2018), have shown that search procedure efficiency is greater if it relies on chaotic sequences, rather than pseudo-random generators.
There are numerous chaotic maps that can be used, however, empirical experiments executed with SCA metaheuristics have shown that the logistic map yields the most promising results. Consequently, the modified SCA at the begining of the execution utilizes the chaotic sequence b, starting by the pseudo-random value b 0 , produced by the logistic maping, as given by the Eq. (11).
where N and l represent the size of the populace and chaotic control value. The parameter l has been set to value 4, while applying the following limitations to b 0 : 0 , b 0 , 1 and b 0 6 ¼ 0:25; 0:5; 0:75; 1. Solution i is subjected to mapping according to the produced chaotic sequences for each component j with respect to the following equations: where the novel location of individual i after chaotic disturbances is denoted by X c i . The entire chaotic-based generation of the initial population is given in Algorithm 2. It is important to note that the introduced initialization procedure is not affecting the algorithm's complexity with respect to the fitness function evaluations FFEs, as it produces only N=2 arbitrary individuals, and afterwards it maps those solutions to the corresponding chaotic solutions.
The second alteration of the basic algorithm, the self-adaptive search strategy, is responsible of alternating the search procedure between conventional SCA search and the FA's (Yang & Slowik, 2020) search procedure given by Eq. (13).
where a denotes the randomization parameter, while j marks the random value drawn from the Gaussian distribution. Distance among two individuals i and j is represented by r i;j . To further improve the FA's exploration and exploitation capabilities, this research utilizes dynamic a, as shown in Yang & Slowik (2020). The proposed SCA algorithm switches between SCA and FA search procedures on the level of every component j of every individual i as follows: in case that the produced pseudo-random value in range ½0; 1 is smaller than the search mode (sm), the j-th part of individual i is updated by applying FA search (Eq. (13), otherwise, conventional SCA search will take place (Eq. (9)). The search mode sm control parameter is controlling the balance betwixt SCA and FA search mechanisms, focusing more on the FA search to update the individuals at the beginning. As the iterations go by, assuming that the search realm was explored sufficiently, the SCA search will be activated more often. This is achieved by dynamically reducing the value of sm parameter from the starting value in every iteration t with respect to: The starting value of sm parameter has been established empirically, and set to 0.8 in all simulations executed in this research.
Algorithm 2 Pseudo-code of the chaotic-based initialization scheme.
Step 1: Produce population Pop of N=2 individuals by applying conventional initialization mechanism: X i ¼ LB þ ðUB À LBÞ Á randð0; 1Þ; i ¼ 1; …N, where randð0; 1Þ is the pseudo-random value within ½0; 1 and LB and UB are arrays with lower and upper boundaries of every individual's component j, respectively.
Step 2: Generate the chaotic population Pop c of N=2 solutions by mapping the individuals belonging to Pop to chaotic sequences by utilizing Eqs. (11) and (12).
Step 3: Integrate Pop and Pop c (Pop [ Pop c ) and sort merged set of size N with respect to fitness value in ascending order.
Step 4: Ascertain the current best individual P. Finally, the introduced enhanced SCA method is actually a low-level hybrid since the FA search mechanism has been incorporated into the SCA method. It was named hybrid adaptive SCA (HASCA), and the pseudo-code that shows the inner workings of this method is given by Algorithm 3.
At the end, it can be noted that the HASCA is not adding additional overhead to the basic SCA and the complexity with respect to the FFEs of both methods, basic and enhanced, are OðNÞ ¼ N Á N Á T.

Proposed classification framework
The suggested classification framework represents a hybrid CNN and ELM structure. The CNN performs the task of feature extraction, where the outputs were collected from the second to the last dense layer (before the final layer). This collection of features were then placed to the inputs of the ELM model that was executing the classification. ELM hyperparameters were optimized by the proposed HASCA algorithm. The flowchart of this classification framework is shown in Fig. 2.
This suggested framework utilizes an empirically established lightweight CNN structure, with a main purpose to be as elementary as possible to permit effortless training and fast executing. The proposed CNN model is comprised of a singe convolutional layer (64 filters, kernel size 6, with relu activation function), batch normalization, max pooling layer with pool size of three, and a pair of dense coats. Since the experiment was the multiclass classifying problem, the common choice for loss is categorical c rossentropy, where the adam optimizer was selected with the default learning rate of lr ¼ 0:001. To measure the performance level of the model, the accuracy was used. Lastly, this structure has been trained by using the batch size of 16, within 10 epochs. This CNN model is presented in Fig. 3.

Solutions encoding scheme
The tuning procedure of the ELM structure in this research consisted of optimizing the number of neurons (nn) in the hidden layer, together with the weights and biases connecting the input and hidden layers. The bounds for weights and biases were calibrated to the span of ½À1; 1, while the limits for the nn were calibrated to ½300; 600. Here, the nn is represented as the integer value, opposite to weights and biases that are continuous inside provided ranges. The optimization of the nn corresponds to the ELM hyperparameters tuning, while the optimization of the weights and biases is related to the training procedure of the ELM model. The range ½300; 600 for nn has been established empirically, aiming to create a network neither too simple or too complex, to avoid overfitting issue. The proposed HASCA algorithm was used to address both described tasks. Every individual that belongs to the population was encoded by utilizing the regular flatswarm encoding scheme. In other words, every solution is structured as a vector of length l, denoting the count of hyper-parameters that were tuned. As l relies on the count of neural cells nn, and length of the input features vector fs, it can be obtained as follows: 1 þ nn Á fs þ nn. More precisely, the first portion of every individual marks the count of neurons (integer), followed by nn elements denoting the biases (continuous), and lastly, the final nn Á fs elements denote the weights (continuous).

EXPERIMENTAL FINDINGS AND DISCUSSION
This section first introduces the datasets utilized throughout the experiments. Afterwards, the experimental setup is briefly explained, followed by the experimental findings and discussion of outcomes. Finally, this section brings forward the statistical tests and validation of the model, together with the SHAP analysis of the most significant features.

TON_IoT database
In order to properly evaluate the usability of IDS and AI-based security solutions, testing with use of proper datasets gathered in real-world conditions is a must. Various datasets have been proposed in the literature for this purpose (Koroniotis et al., 2019;Moustafa & Slay, 2015), such as DARPA 98 and KDD-99 (Lippmann et al., 2000), which are now considered outdated due to implemented attack scenarios originating back to 1998, then ADFA-LD (Creech & Hu, 2013) and NGIDS-DS (Haider et al., 2017). The mentioned datasets are generated on Linux Operating System (OS), while SSENet-2014, AWSCTD and ADFA-WD datasets are suggested for Windows OS machines (Moustafa et al., 2020). TON IoT belongs to the new generation of Industry 4.0 databases. It was created with the aim to compensate for the perceived shortcomings of existing datasets, such as the lack of data on the behavior of memory, hard drives and processors, as well as data related to IoT. It includes federated data sources collected from IoT service telemetry datasets, Windows and Linux OSs datasets, and network traffic datasets. OS datasets are collected from memory, processor, network, process and hard disk audit trails, Fig. 4. Such datasets can be used to test AI-based cyber security solutions, including IDSs, threat intelligence and hunting, privacy protection and digital forensics.

Datasets
In the experiment, the TON_IoT datasets for Windows 7 and Windows 10 OS were used, containing 133 vs. 125 features respectively, and significant amounts of data (28,367 vs 35,975 records) generated using Virtual Machines with appropriate OSs and Windows performance tracking tools, as described in Tableau (2020). Each training set contains 10,000 records from regular traffic, as well as data classified as attacks, 5,980 for Windows 7 and 11,104 for Windows 10, Figs. 5 (binary distribution) and 6 (multiclass distribution). These sets furthermore can be divided into subsets in the 70%/ 30% ratio, for training and for testing of the AI model (Moustafa et al., 2020).
Correlation analysis shows the importance of features and their use value. A custom correlation function can be used to determine the correlation coefficient between features without a label and to rank the features by strength in the range [−1, 1]. The sign of the correlation coefficient indicates the direction of the connection, while the coefficient itself indicates the strength of connection between two features (Koroniotis et al., 2019). A correlation matrix, Figs. 7 and 8, proved useful representing the most closely related features, listed in Tables 1 and 2 which will then be used to train and validate the effectiveness of the ML model in the classification of attacks from the data set. From the presented correlation matrices, it can be noticed that in both datasets exists high positive correlation between a pair of features. In the case of Win 7 benchmark there is 99% correlation between "IO data bytes sec" and "Process (_Total) IO Read Bytes sec" attributes, while in the case of Win 10 dataset, there is a correlation of 93% between "Disk ready bytes sec" and "Memory page reads sec" features exists. This seems to be very logical because each pair of features in both datsets refer to the input output (IO) read performance of the system. At the first glance, it may seem as these are redundant (derived) features. However, since the CNN is used for feature extraction in this research, those features were used as input to the model along with other attributes.

Metrics
The model introduced in this research has been validated according to the conventional machine learning metrics, that rely on true negatives (TN), true positives (TP), false negatives (FN) and FPs projections. These metrics allow obtaining crucial key performance indicators, including accuracy, precision, recall, and F-score. Moreover, this research utilizes the Cohen's kappa coefficient j (Cohen, 1960), that was used as the objective function that is necessary to be maximized. Cohen's kappa coefficient   denotes a statistical metric that can be utilized to discover inter-rater reliability (McHugh, 2012). It is also possible to use it for estimation of the performance level of the given classifier. Cohen's kappa value is calculated from the confusion matrices utilized by machine learning models to evaluate both binary and multiclass classifications. In contrast to the overall accuracy of the classifier, that could be misleading for the imbalanced datasets, Cohen's kappa takes into consideration the imbalance within classes distribution, therefore providing more durable findings. Cohen's kappa is calculated according to the Eq. (15): where p o represents the collection of observed values, and p e denotes the expected values.

Experimental setup
As previously noted, the ELM model requires tuning for every individual classification task. The developed HASCA algorithm was used to lead the tuning procedure. The outcomes of the ELM optimized by HASCA were put into comparisons to the scores acquired by seven additional powerful metaheuristics, separately implemented for the sake of this research, and deployed in the identical framework as the HASCA, to optimize the ELM model. The chosen contending algorithms were the basic SCA, Artificial bee colony (ABC) (Karaboga & Basturk, 2008), bat algorithm (BA) (Yang & Gandomi, 2012), whale optimization algorithm (WOA) (Mirjalili & Lewis, 2016), elephant herding optimization (EHO) (Wang, Deb & Coelho, 2015), chimp optimization algorithm (ChOA) (Khishe & Mosavi, 2020) and reptile search algorithm (RSA) (Abualigah et al., 2022). Every competitor metaheuristic has been implemented by making use of the the native control parameter's values as described by its author. Each one of the algorithms was executed with 15 individuals within the population, 15 rounds in each run, and 15 independent executions. In order to simplify the interpretation of the experimental outcomes, prefix CNN-ELM was appended before each metaheuristics (CNN-ELM-

Experimental results and discussion
This section presents the simulation outcomes over both considered datasets, and discussed the attained experimental results. For all tables that contain the experimental results, the best scores in every considered category are accentuated in bold.  Table 3 contains the determined count of neurons nn in ELM. Since the range of search for nn was set to ½300; 600, it is clear that almost all models converged to the upper limit. However, since the performance of the    HASCA model once more outclassed every other contender, with respect to the best, worst, mean and median values, leaving behind CNN-ELM-ChOA and CNN-ELM-WOA approaches. Concerning standard deviation and variance values, CNN-ELM-SCA again attained the best results, providing most stable results over the runs. Table 5 brings forward the comprehensive metrics achieved by the best run of every regarded model. It is worth noting that the suggested CNN-ELM-HASCA model attained the superior accuracy level of 98.67%, while CNN-ELM-ChOA and CNN-ELM-WOA acquired the second best accuracy of 98.63%. The suggested CNN-ELM-HASCA method has also displayed supremacy when taking into account other statistical categories, as it attained the best results for the majority of other indicators.

Windows 7 dataset experimental results
Visualisations of the experimental outcomes attained on the Windows 7 dataset have been provided in Figs. 9 and 10. Figure 9 exposes the convergence graphs, box and violin plots of all noted algorithms with respect to the fitness function (Cohen's kappa in experiments) and classifying error rate. The convergence graphs exhibit the supremacy of the HASCA method, as it is possible to note that employed switching mechanism betwixt SCA and FA search processes significantly aids to the fast converging capabilities of the suggested approach. Additionally, Fig. 10 exhibits the swarm plots making possible to estimate the diversity of solutions during the last iteration of the best run for every noted method, with respect to the fitness function and classification error rate. Once more, one can take notice that each one of the solutions during the last iteration of HASCA run have been closely placed in the proximity of the best individual. Lastly, Fig. 10 also provides kernel density estimation diagrams (KDE), used to visually show that the simulation results belong to the normal distribution. Classification algorithm's performance level is commonly defined by utilizing a confusion matrix, that visually presents the classification accuracy and errors. Moreover, the precision-recall (PR) and receiver operating characteristics (ROC) curves are two very important visual tools for classification tasks. Area under PR curve (PR AUC) is combining the precision and recall within single plot, while area under ROC curve (ROC AUC) shows the trade-off betwixt the true positive and false positive rates. Therefore, to further visualize the performance of the proposed CNN-ELM-HASCA approach on Win 7 dataset, Fig. 11 displays the confusion matrix, while Fig. 12 depicts PR AUC and ROC AUC curves achieved by the suggested method.
Windows 10 dataset experimental results Table 6 displays the simulation outcomes attained by the regarded models over Windows 10 dataset, with respect to the fitness function (Cohen's kappa score) over 15 independent executions. The suggested CNN-ELM-HASCA model once again attained supreme results,  outclassing every other contender model with respect to all observed metrics-the best, worst, mean and median values, and also standard deviation and variance scores. Simply said, the suggested model in this case not only acquired the best results, it also put consistent performance level over the independent executions, constantly providing results near to the mean value. CNN-ELM-ChOA secured the second place, while   Table 6 contains the determined count of neurons nn in ELM. Since the range of search for nn was set to ½300; 600, it is once again obvious that almost all models converged to the upper limit. However, since the performance of the model significantly depends of the weight and biases, the proposed CNN-ELM-HASCA attains the best results with just 539 neurons, than CNN-ELM-WOA and CNN-ELM-ChOA (536 and 532 neurons, respectively), or other algorithms that produced models with more neurons. With regard to the classifying error ratio achieved on Windows 10 dataset, Table 7 summarizes the scores for each one of the contending models. Similarly to the fitness function scores, the proposed CNN-ELM-HASCA model once more outclassed every other contender, with respect to all observed metrics-the best, worst, mean and median values, and also standard deviation and variance values, leaving behind CNN-ELM-ChOA and CNN-ELM-SCA models. Table 8 encapsulates the comprehensive metrics attained by the best run of every regarded model. It is worth noting that the suggested CNN-ELM-HASCA model once more attained the superior accuracy level of 96.65%, leaving behind CNN-ELM-ChOA the second best accuracy of 96.64% and CNN-ELM-SCA that acquired the third best accuracy of 96.62%. The suggested CNN-ELM-HASCA method has also displayed supremacy when  Figure 13 exposes the convergence graphs, box and violin plots of all noted algorithms with respect to the fitness function (Cohen's kappa in experiments) and classifying error rate. The convergence graphs again exhibit the supremacy of the HASCA method, where it is obvious that employed switching mechanism betwixt SCA and FA search processes significantly aids to the fast converging capabilities of the suggested approach. Additionally, Fig. 14 exhibits the swarm plots making possible to estimate the diversity of solutions during the last iteration of the best run for every noted method, with respect to the fitness function and classification error rate. Once more, one can take notice that each one of the solutions during the last iteration of HASCA run have been closely placed in the proximity of the best individual. Finally, Fig. 14 also provides KDE diagrams, showing that the simulation results belong to the normal distribution.
To further visualize the performance of the proposed CNN-ELM-HASCA approach on Win 10 dataset, Fig. 15

Statistical tests and model validation
In order to verify the experimental outcomes and establish if they are significant from the statistical point of view, the best results from every of the fifteen executions of each considered algorithm with respect to both considered problem cases (Win 7 and Win 10 datasets) were collected and investigated as data series. Nevertheless, at the beginning, it is required to establish which type of the statistical tests is appropriate-parametric or nonparametric. Prior to deciding to utilize the non-parametric tests, it is required to check if it is possible to use the parametric tests, by examining independence, normality, and homoscedasticity of the data variances (LaTorre et al., 2021). The first requirement, namely independence, is fulfilled as each execution of all algorithms starts by generating a set of pseudo-random variables. The second requirement, homoscedasticity, has been validated by performing Levene's test (Glass, 1966), and as the p-value of 0.67 was determined in every case, one can conclude that the homoscedasticity requirement has also been satisfied. For verification of the last, normality requirement, Shapiro-Wilk single problem analysis has been applied (Shapiro & Francia, 1972). Shapiro-Wilk p-values have been established independently with respect to each of the regarded algorithms. The determined p-values for each approach were larger than 0.05, allowing the conclusion that the H0 hypothesis can not be rejected for both alpha ¼ 0:05 and alpha ¼ 0:1. Consequently, it means that the observed results are originating from the normal distribution. It was possible to come to the similar conclusion by simply looking at the KDE diagrams in Figs. 10 and 14. The Shapiro-Wilk testing values are given in Table 9. As the normality condition is fulfilled as well, it is allowed to safely utilize the parametric test. Within this manuscript, the paired-t test has been employed (Hsu & Lachenbruch, 2014), since it is a common choice when metaheuristics-based algorithms are evaluated (Chen et al., 2014). Paired-t test may be employed in case if the collection of data values can be observed as paired measurements, where the distribution of differences betwixt the pairs is required to follow the normal distribution as well. Simply put, differences among samples of each pair of metaheuristics should be normally distributed. Aiming to examine this, the absolute differences among distributions of the suggested method and other contenders were determined, followed by another application of the Shapiro-Wilk on each absolute difference. The outcomes of the Shapiro-Wilk test have shown that the p-values in all instances were larger than threshold value 0.05, allowing the conclusion that it is not possible to reject H0 hypothesis for alpha ¼ 0:05, which means that the observed values are belonging to the normal distribution. As this is the precondition for execution of the paired-t test, it means that paired-t test can be safely used, comparing the suggested method to each and every one of contenders. The outcomes of Shapiro-Wilk p-values established on the differences between the suggested approach and other contenders, followed by the paired-t test outcomes are shown in Table 10. In case of the paired-t test, the p-values are smaller then 0.05 for all algorithms excluding SCA and ChOA with respect to the Win 10 dataset (0.071 with SCA and 0.076 with ChOA). Accordingly, it can be established that the introduced HASCA method is significantly superior over all contenders for threshold alpha ¼ 0:1, and significantly superior than all contenders excluding SCA and ChOA algorithms, when the threshold value alpha ¼ 0:05 is observed.
One of the most important tasks when analysing the results of the machine learning models is to interpret them properly, aiming to discover what are the most influential features with respect to the target variable. Proper interpretation will allow the decision makers to decide more confidently, and that can be vital in the network security area. To explain the behavior of model observed in this research, the advanced explainable AI method SHAP was utilized, allowing the better understanding of the simulation results. SHAP procedure allows easy and fitting interpretations of the predictions made by the observed model, by measuring the importance of each feature, inspired by the game theory (Lundberg & Lee, 2017). Simply said, Shapley collection of values represents the distributed payouts betwixt the features, with respect to the every feature's contibution towards the joint payout (denoting the prediction in this case). Finally, SHAP method supplements each feature with importance indicator, that measures the contribution of every one of the features on the specific forecast. Figure 17 presents the SHAP summary plots allowing to analyse the influence of features on output classes, for both Windows 7 and Windows 10 dataset. Moreover, Fig. 18 provides simple SHAP waterfall plots, showing the extent of features affecting observation 8 with respect to class 0 (normal). Figure 19 depicts how features influence class 0 (normal traffic), class 1 (dos) and class 4 (backdoor), with respect to the experiments with Windows 7 dataset. Similarly, 20 displays the effect of features on class 0 (normal traffic), class 1 (dos), class 2 (ddos) and class 5 (password) for the experiments with Windows 10 dataset. Intriguingly, given SHAP visualizations indicate that the last feature (in both Win 7 and Win 10 experiments) almost has no influence whatsoever, concluding that it may be removed based on the SHAP analysis. This feature was not removed according to Moustafa et al. (2020), however, this is an important observation. Moreover,as examples presented in Figs. 19 and 20 indicate, the analysis of the effect each feature has on a specific target can be performed. For instance, increasing the value of Memory.Pool.Paged Bytes attribute will also add to the influence that the target in that particular case will be class 1 (dos attack). Similarly, if the attribute Network_I.Intel. R82574L_GNC.Packets Sent.sec is decreased, showing the speed of sending packets through the network interface, will increase the effect on the outcome to be classified as DDOS attack.
Aiming to determine the most important attributes, it is possible to conclude that the largest influence in case of the Win 7 dataset have Process Pool Paged Bytes, Network I. Intel.R Pro 1000MT.Bytes Sent sec, and Process.Total IO Data Operations sec features. With respect to Win 10 dataset, the most important attributes are Memory.Pool.Paged Bytes, Network_I.Intel.R82574L_GNC.Packets Sent.sec, and Process.Total IO Data Operations sec. Generally speaking, the SHAP analysis clearly indicates that the possibility of the real network attack is high if the problems such as the increased virtual memory utilization and physical memory paging, or reduced speed of read and write procedures during I/O operations have been noticed. This is also in accordance to the real world experience, as it was confirmed in practice countless times.

CONCLUSION
The intrusion detection problem in IoT networks is crucial, as unauthorized access or compromised data could lead to leaking of private information, reputation loss or even human casualties. To address this problem and keep IoT network secure, it is necessary to quickly and consistently differentiate between the malicious actions and regular activities. The research presented in this manuscript introduces a novel hybrid intrusion detection structure that can be utilized for this crucial task. The suggested approach is relying on the novel HASCA method, that was developed by modifying the elementary SCA metaheuristics and incorporating the FA search mechanism. The basic SCA algorithm has a powerful exploration, however, it does not have sufficient exploitation capabilities. Creating a low-level hybrid with FA that is known for the strong exploitation seems like a logical choice, where the advantages of both algorithms could mutually overcome their respective drawbacks. The HASCA algorithm begins execution by using the basic SCA search mechanism, however, in later stages, it is alternating betwixt SCA and FA search procedures, to enhance the exploitation.
This novel HASCA metaheuristics was used within the hybrid ML framework, that consists of the lightweight CNN and ELM model, where HASCA was used to tune the ELM's structure (number of neurons in its single hidden layer), as well as in determining weights and biases between neurons. Introduced framework was entitled CNN-ELM-HASCA, and its performance was validated on two intrusion detection benchmark instances (Win 7 and Win 10 datasets). The attained experimental outcomes were compared to the results achieved by seven other contending metaheuristics algorithms, tested as part of the identical framework, and utilized to tune the ELM for the observed task. The proposed CNN-ELM-HASCA attained superior level of accuracy of 98.67% and 96.65% over Win 7 and Win 10 datasets, respectively.
However, as in any other research work, the proposed study outlines some limitations. First of all, the CNN structure used for feature extraction was determined manually, by 'trial and error' approach. Automatic evolving of CNN's structure (set of hyperparameters' values), e.g., by utilizing metaheuristics, is very resource-intensive and it would require additional time and computing resources. However, this could be a promising topic for future research in the area. Secondly, introduced framework was evaluated on only two multi-class datasets and thus, it may require further evaluation on a wider set of benchmarking data. Finally, the SCA metaheuristics could also be further improved by investigating hybridization with other metaheuristics that show good exploitation abilities.
Regardless of above mentioned limitations, experimental outcomes presented in this study are very encouraging, and the future experiments will be focused on gaining further confidence into the suggested CNN-ELM-HASCA model. This will include validation on supplementary real-world datasets, prior to possible implementation as a part of the real IDS. Also, a further research may turn towards tuning of the CNN structure along with the ELM for this important challenge as a part of the two-level framework-CNN tuning in the first layer and the ELM optimization in the second one. Additionally, due to the fact that there is still a significant research gap in this domain, with numerous ML/DL models, and with metaheuristics available in the modern literature, future research may also be focused on experimentation with various ML/DL models and metaheurisitcs combinations for significant IoT security challenge.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This research is funded by the Universiti Kebangsaan Malaysia (Grant code: GUP-2022-060). Nor Samsiah Sani and Maifuza Mohd Amin provided funding for the proposed research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.